1. Introduction

The goal of this project is to analyze and estimate the factors associated with the obesity. Obesity is one of the major problems in the current generation and has been growing steadily. There are many scientific publications over the past few years, specifically concentrating on the factors included in this project’s dataset that is used largely in software industry to build better tools.

We chose this research topic after analyzing the facts from WHO about obesity [2]:

The source our research project is the paper published by Fabio Mendoza Palcechor, this paper presents data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia [1] and we acquired the dataset from the UCI Machine Learning Repository [3]. The dataset contains numerical, ordinal and categorical attributes and has sufficient observations for regression study that fulfills this course requirement. Further, this dataset is represented as comma-separated values with 17 variables or attributes and 2,111 records. It’s made available in CSV format as well as in an Attribute-Relation File Format (ARFF) format for use with the Weka machine learning software.

The dataset can be accessed and downloaded by following this link.

Why are you creating a model for this data?

Although predicting an individual’s weight might be considered an uncommon or even an exceptional ask, estimating it might prove beneficial in certain cases: - Imputing the Weight: The model will provide a better source for imputing a person’s weight if the other dietary and social predictors are present. - Industrial and commerical industries might still be able to estimate the weights of their customers if, for some reason, HIPAA considerations only allow for the availability of the predictors selected in our model. Airline, marketing and pharmaceutical companies will not be precluded to conduct studies, tests and experiments (proof of concept) if weight as a variable is not present but otherwise vital for such endeavors. (edited)

What is the goal of this model?

The goal of the model is to estimate the average weight of an individual given information about the individual’s dietary, social and other physical characteristics.

Primarily our aim is to explore these topics but not limited to

In addition to predicting weight, our research metrics about obesity and body mass index (BMI) shall be compared with the baseline published by Mendoza and WHO [1]:

This project report encompasses topics:

Load dataset

obesity=readRDS("data/obesity.Rda")
dim(obesity)
## [1] 2111   17

Provided predictors

str(obesity)
## 'data.frame':    2111 obs. of  17 variables:
##  $ Gender                        : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 2 2 2 ...
##  $ Age                           : num  21 21 23 27 22 29 23 22 24 22 ...
##  $ Height                        : num  1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
##  $ Weight                        : num  64 56 77 87 89.8 53 55 53 64 68 ...
##  $ family_history_with_overweight: Factor w/ 2 levels "yes","no": 1 1 1 2 2 2 1 2 1 1 ...
##  $ FAVC                          : Factor w/ 2 levels "yes","no": 2 2 2 2 2 1 1 2 1 1 ...
##  $ FCVC                          : num  2 3 2 3 2 2 3 2 3 2 ...
##  $ NCP                           : Factor w/ 4 levels "1","2","3","4": 3 3 3 3 1 3 3 3 3 3 ...
##  $ CAEC                          : Factor w/ 4 levels "no","Sometimes",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ SMOKE                         : Factor w/ 2 levels "yes","no": 2 1 2 2 2 2 2 2 2 2 ...
##  $ CH2O                          : num  2 3 2 2 2 2 2 2 2 2 ...
##  $ SCC                           : Factor w/ 2 levels "yes","no": 2 1 2 2 2 2 2 2 2 2 ...
##  $ FAF                           : num  0 3 2 2 0 0 1 3 1 1 ...
##  $ TUE                           : num  1 0 1 0 0 0 0 0 1 1 ...
##  $ CALC                          : Factor w/ 4 levels "no","Sometimes",..: 1 2 3 3 2 2 2 2 3 1 ...
##  $ MTRANS                        : Factor w/ 5 levels "Automobile","Motorbike",..: 4 4 4 5 4 1 2 4 4 4 ...
##  $ NObeyesdad                    : Factor w/ 7 levels "Insufficient_Weight",..: 2 2 2 3 4 2 2 2 2 2 ...

Summary of each variable

summary(obesity)
##     Gender          Age           Height         Weight     
##  Female:1043   Min.   :14.0   Min.   :1.45   Min.   : 39.0  
##  Male  :1068   1st Qu.:19.9   1st Qu.:1.63   1st Qu.: 65.5  
##                Median :22.8   Median :1.70   Median : 83.0  
##                Mean   :24.3   Mean   :1.70   Mean   : 86.6  
##                3rd Qu.:26.0   3rd Qu.:1.77   3rd Qu.:107.4  
##                Max.   :61.0   Max.   :1.98   Max.   :173.0  
##                                                             
##  family_history_with_overweight  FAVC           FCVC      NCP     
##  yes:1726                       yes:1866   Min.   :1.00   1: 395  
##  no : 385                       no : 245   1st Qu.:2.00   2: 285  
##                                            Median :2.39   3:1362  
##                                            Mean   :2.42   4:  69  
##                                            3rd Qu.:3.00           
##                                            Max.   :3.00           
##                                                                   
##          CAEC      SMOKE           CH2O       SCC            FAF       
##  no        :  51   yes:  44   Min.   :1.00   yes:  96   Min.   :0.000  
##  Sometimes :1765   no :2067   1st Qu.:1.58   no :2015   1st Qu.:0.124  
##  Frequently: 242              Median :2.00              Median :1.000  
##  Always    :  53              Mean   :2.01              Mean   :1.010  
##                               3rd Qu.:2.48              3rd Qu.:1.667  
##                               Max.   :3.00              Max.   :3.000  
##                                                                        
##       TUE                CALC                        MTRANS    
##  Min.   :0.000   no        : 639   Automobile           : 457  
##  1st Qu.:0.000   Sometimes :1401   Motorbike            :  11  
##  Median :0.625   Frequently:  70   Bike                 :   7  
##  Mean   :0.658   Always    :   1   Public_Transportation:1580  
##  3rd Qu.:1.000                     Walking              :  56  
##  Max.   :2.000                                                 
##                                                                
##                NObeyesdad 
##  Insufficient_Weight:272  
##  Normal_Weight      :287  
##  Overweight_Level_I :290  
##  Overweight_Level_II:290  
##  Obesity_Type_I     :351  
##  Obesity_Type_II    :297  
##  Obesity_Type_III   :324


2. Methods

2.1. Exploratory Data Analysis

Observe Collinearity Issues Among Variables

obesity.pairs = obesity %>% dplyr::select(-Weight)
pairs.panels(obesity.pairs)

# set aside the numeric predictors
obesity_numeric = subset(obesity, select = c("Age", "Height", "NCP", "CH2O", "FAF", "TUE"))
# present the correlation table
kable(cor(obesity_numeric[sapply(obesity_numeric, function(x) !is.factor(x))], method = "pearson"), digits = 2, "html", 
      caption = "") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = T, font_size = 12) 
Age Height CH2O FAF TUE
Age 1.00 -0.03 -0.05 -0.14 -0.30
Height -0.03 1.00 0.21 0.29 0.05
CH2O -0.05 0.21 1.00 0.17 0.01
FAF -0.14 0.29 0.17 1.00 0.06
TUE -0.30 0.05 0.01 0.06 1.00

The pair plot below broadly shows the correlation between all the predictors in the obesity dataset.

Data distribution by Age, Gender and Mode of transportation

obesity=obesity%>%mutate(ageBins = cut(Age, breaks = seq(0,100,by=10)))
age_agg=obesity %>% count(ageBins,Gender)
ggplot(age_agg, aes(ageBins, n, fill = Gender)) +     
  geom_col(position = 'dodge')+
    xlab("Age Group")+
    ylab("Frequency")+
    ggtitle("Age group and Frequency comparison by Gender") 

On the plot majority of the observations are between 10 to 40 age range and 20-30 group has maximum number of records records.

age_mot_agg=obesity %>% count(ageBins,MTRANS)
ggplot(age_mot_agg, aes(fill=MTRANS,y=n, x=ageBins)) + 
    geom_bar(position="stack", stat="identity") +
    scale_fill_viridis(discrete = T) +
    xlab("Age Group")+
    ylab("Frequency")+
    ggtitle("Age group and Frequency comparison by Mode of Transport")

Graph shows interesting factors about mode of transportation, majority of people under 30 prefer Public transport and conversely, above 30 prefer Automobile for transport.

Attributes related with eating habits analysis:

  • Frequent consumption of high caloric food (FAVC)

  • Frequency of consumption of vegetables (FCVC)

  • Number of main meals (NCP)

  • Consumption of food between meals (CAEC)

  • Consumption of water daily (CH20)

  • Consumption of alcohol (CALC)

p1 = ggboxplot(obesity, x = "family_history_with_overweight", y = "Weight",
    color = "family_history_with_overweight", palette = "jco")
p2 = ggboxplot(obesity, x = "Gender", y = "Weight",
    color = "Gender", palette = "jco")
p3=ggboxplot(obesity, x = "MTRANS", y = "Weight",
    color = "MTRANS", palette = "jco")
p4=ggboxplot(obesity, x = "FAVC", y = "Weight",
    color = "FAVC", palette = "jco")+
    xlab("FAVC")
p5=ggboxplot(obesity, x = "NCP", y = "Weight",
    color = "NCP", palette = "jco")+
    xlab("NCP")
p6=ggboxplot(obesity, x = "CAEC", y = "Weight",
    color = "CAEC", palette = "jco")+
    xlab("CAEC")
p7=ggboxplot(obesity, x = "SMOKE", y = "Weight",
    color = "SMOKE", palette = "jco")+
    xlab("SMOKE")
ggarrange(p1,p2,p3,nrow=3,ncol=1)

ggarrange(p4,p5,p6,p7,nrow=2,ncol=2)

Plot shows categorical variables related to eating habits have relationship with Weight.

ggplot(obesity, aes(x=Age, color=Gender)) +
  geom_histogram(fill="white", position="dodge")+
  theme(legend.position="top") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p1=ggplot(data = obesity) +
  geom_point(mapping = aes(x = Height, y = Weight,color=NCP))+
  ggtitle("Height vs Weight by Number of main meals (NCP) ")
p2=ggplot(data = obesity) +
  geom_point(mapping = aes(x = Height, y = Weight,color=SMOKE))+
  ggtitle("Height vs Weight by Smoking habit")
ggarrange(p1,p2,nrow=2,ncol=1)

ggplot(data = obesity) +
  geom_point(mapping = aes(x = Height, y = Weight,color=FAVC))+
  ggtitle("Height vs Weight by Frequent consumption of high caloric food")

Attributes related with the physical condition are:

  • Calories consumption monitoring (SCC)
ggboxplot(obesity, x = "SCC", y = "Weight",
    color = "SCC", palette = "jco")

Above depicts people monitoring calories consumption are more likely to be fit.


2.2. Data Cleaning

Coerce Categorical Variables to Factors

The columns and descriptions having been identified, we now proceed to coerce those variables that need to be defined as factor variables to aid in our modeling later.

load_cleansed=function(data = obesity){
  obesity$Gender=as.factor(as.integer(obesity$Gender))
  obesity$family_history_with_overweight=as.factor(as.integer(obesity$family_history_with_overweight))
  obesity$FAVC=as.factor(as.integer(obesity$FAVC))
  obesity$FCVC=as.factor(as.integer(obesity$FCVC))
  obesity$NCP=as.factor(as.integer(obesity$NCP))
  obesity$CAEC=as.factor(as.integer(obesity$CAEC))
  obesity$SMOKE=as.factor(as.integer(obesity$SMOKE))
  obesity$SCC=as.factor(as.integer(obesity$SCC))
  obesity$FAF=as.factor(as.integer(obesity$FAF))
  obesity$TUE=as.factor(as.integer(obesity$TUE))
  obesity$CALC=as.factor(as.integer(obesity$CALC))
  obesity$MTRANS=as.factor(as.integer(obesity$MTRANS))
  obesity$CH2O=as.factor(as.integer(obesity$CH2O))
  obesity$NObeyesdad=as.factor(as.integer(obesity$NObeyesdad))
  obesity
}
obesity = load_cleansed(obesity)
str(obesity)
## 'data.frame':    2111 obs. of  18 variables:
##  $ Gender                        : Factor w/ 2 levels "1","2": 1 1 2 2 2 2 1 2 2 2 ...
##  $ Age                           : num  21 21 23 27 22 29 23 22 24 22 ...
##  $ Height                        : num  1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
##  $ Weight                        : num  64 56 77 87 89.8 53 55 53 64 68 ...
##  $ family_history_with_overweight: Factor w/ 2 levels "1","2": 1 1 1 2 2 2 1 2 1 1 ...
##  $ FAVC                          : Factor w/ 2 levels "1","2": 2 2 2 2 2 1 1 2 1 1 ...
##  $ FCVC                          : Factor w/ 3 levels "1","2","3": 2 3 2 3 2 2 3 2 3 2 ...
##  $ NCP                           : Factor w/ 4 levels "1","2","3","4": 3 3 3 3 1 3 3 3 3 3 ...
##  $ CAEC                          : Factor w/ 4 levels "1","2","3","4": 2 2 2 2 2 2 2 2 2 2 ...
##  $ SMOKE                         : Factor w/ 2 levels "1","2": 2 1 2 2 2 2 2 2 2 2 ...
##  $ CH2O                          : Factor w/ 3 levels "1","2","3": 2 3 2 2 2 2 2 2 2 2 ...
##  $ SCC                           : Factor w/ 2 levels "1","2": 2 1 2 2 2 2 2 2 2 2 ...
##  $ FAF                           : Factor w/ 4 levels "0","1","2","3": 1 4 3 3 1 1 2 4 2 2 ...
##  $ TUE                           : Factor w/ 3 levels "0","1","2": 2 1 2 1 1 1 1 1 2 2 ...
##  $ CALC                          : Factor w/ 4 levels "1","2","3","4": 1 2 3 3 2 2 2 2 3 1 ...
##  $ MTRANS                        : Factor w/ 5 levels "1","2","3","4",..: 4 4 4 5 4 1 2 4 4 4 ...
##  $ NObeyesdad                    : Factor w/ 7 levels "1","2","3","4",..: 2 2 2 3 4 2 2 2 2 2 ...
##  $ ageBins                       : Factor w/ 10 levels "(0,10]","(10,20]",..: 3 3 3 3 3 3 3 3 3 3 ...

We also experimented with adjusting some of the categorical variables, particularly MTRANS and NObeyesdad to reduce the factor levels, however in testing we found that did not improve our results so we used the original values for those variables.

2.3. Common functions to perform model diagnostics

Split dataset test and train

set.seed(04082021)
## 75% of the sample size
smp_size <- floor(0.75 * nrow(obesity))
## set the seed to make your partition reproducible
trn_ind <- sample(seq_len(nrow(obesity)), size = smp_size)
obs_trn <- obesity[trn_ind, ]
obs_test <- obesity[-trn_ind, ]

Useful functions

#' Generate plots for regression assumptions- Linearity, Normality and Equal variance
#' i.e. Fitted vs Residual, QQ Plot, Residuals histogram
#' @param obs_mod A model for plots

plot_diagnostics=function (model){
  par(mfrow = c(2, 2))
  title=paste("Data from model: \n",summary(model)$call[2], summary(model)$call[3])
  plot(fitted(model), resid(model),pch = 20,  cex.main=1,cex.lab=1,xlab = "Fitted", ylab = "Residuals")
  abline(h = 0, lty = 2, lwd = 2, col="darkorange")
  qqnorm(resid(model), main = "Normal Q-Q Plot", col = "darkgrey") 
  qqline(resid(model), col = "dodgerblue", lwd = 2)
  hist(resid(model),
       xlab   = "Residuals",
       main   = "Histogram of Residuals",
       col    = "darkorange",
       border = "dodgerblue",
       breaks = 20)
}

#' Return model performance metrics
#' R2, Adj R2,Train_RMSE, Test_RMSE, BPTEST, SP Test
#
#' @param obs_mod A model for metrics is required
#' @param test dataset- used to calculate test RMSE. Default value is obesity test data

mod_performance=function(obs_mod,obs_tst=obs_test){
  smry=summary(obs_mod)
  #Train RMSE
  trn_rmse=sqrt(mean(resid(obs_mod)^2))
  pred=predict(obs_mod,newdata=obs_tst)
  #test RMSE
  tst_rmse=sqrt(mean((obs_tst$Weight-pred)^2))
  scores=list(r2=smry$r.squared
       ,adjr2=smry$adj.r.squared,
       Train_rmse=trn_rmse,Test_rmse=tst_rmse,
       bptest=bptest(obs_mod)$p.value
       ,sptest=shapiro.test(resid(obs_mod))$p.value)
  
  unlist(scores, recursive = TRUE, use.names = TRUE)
}
bar_plot_summary=function(mod){
  ggplot(mod, aes(y=n, x=ageBins)) + 
    geom_bar(position="stack", stat="identity") +
    scale_fill_viridis(discrete = T) +
    xlab("Age Group")+
    ylab("Frequency")
}

2.4. Model Diagnostics, Performance and Selection

Model1- Additive model with all the parameters

obs_mod_add=lm(Weight~.-NObeyesdad-ageBins,data=obs_trn)
summary(obs_mod_add)

Note: we have removed derived variables agebins, NObeyesdad

mod_performance(obs_mod_add)

Next, we will look at unusual observations:

Outliers analysis using the cook’s distance

sum(cooks.distance(obs_mod_add)>4/nrow(obs_mod_add))
## [1] 0

There are no outliers idenitified.

Unusual observations using hatvalues

#We can not remove the records based on leverage as it's considering critical records as unusual
htval=hatvalues(obs_mod_add)
obs_mod_add_unusual_observations=obs_trn[(htval < 2 * mean(htval)),]

Further we explore the Variable selection methods AIC, BIC, Exhastive search to find the best model

  • AIC search in both directions
  • BIC Search in both directions
  • Variable selection
obs_mod_add_aic=step(obs_mod_add,direction="both",trace=0)
obs_mod_add_bic=step(obs_mod_add,direction="both",trace=0,k=log(nrow(obs_trn)))
p1=mod_performance(obs_mod_add_aic)
p2=mod_performance(obs_mod_add_bic)
p1=append(p1,unlist(list(num_predictors=length(coef(obs_mod_add_aic)))))
p2=append(p2,unlist(list(num_predictors=length(coef(obs_mod_add_bic)))))
mod1_stats=data.frame(obs_mod_add_aic=p1,obs_mod_add_bic=p2)

Exhaustive search

obs_mod_ex_search = summary(regsubsets(Weight ~ .-NObeyesdad-ageBins, data = obesity))
obs_mod_ex_search$which
##   (Intercept) Gender2   Age Height family_history_with_overweight2 FAVC2 FCVC2
## 1        TRUE   FALSE FALSE  FALSE                            TRUE FALSE FALSE
## 2        TRUE   FALSE FALSE   TRUE                            TRUE FALSE FALSE
## 3        TRUE   FALSE FALSE   TRUE                            TRUE FALSE FALSE
## 4        TRUE   FALSE FALSE   TRUE                            TRUE FALSE FALSE
## 5        TRUE   FALSE FALSE   TRUE                            TRUE FALSE FALSE
## 6        TRUE   FALSE FALSE   TRUE                            TRUE FALSE FALSE
## 7        TRUE   FALSE  TRUE   TRUE                            TRUE FALSE FALSE
## 8        TRUE   FALSE  TRUE   TRUE                            TRUE FALSE FALSE
##   FCVC3  NCP2  NCP3  NCP4 CAEC2 CAEC3 CAEC4 SMOKE2 CH2O2 CH2O3  SCC2  FAF1
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE FALSE FALSE FALSE FALSE
## 3  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE FALSE FALSE FALSE FALSE
## 4  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  FALSE FALSE FALSE FALSE FALSE
## 5  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  FALSE FALSE FALSE FALSE FALSE
## 6  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  FALSE FALSE FALSE FALSE FALSE
## 7  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  FALSE FALSE FALSE FALSE FALSE
## 8  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  FALSE FALSE FALSE FALSE FALSE
##    FAF2  FAF3  TUE1  TUE2 CALC2 CALC3 CALC4 MTRANS2 MTRANS3 MTRANS4 MTRANS5
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE   FALSE   FALSE   FALSE
## 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE   FALSE   FALSE   FALSE
## 3 FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE   FALSE   FALSE   FALSE
## 4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE   FALSE   FALSE   FALSE
## 5  TRUE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE   FALSE   FALSE   FALSE
## 6  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE   FALSE   FALSE   FALSE   FALSE
## 7  TRUE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE   FALSE    TRUE   FALSE
## 8  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE   FALSE   FALSE    TRUE   FALSE

Quantifies the effect of collinearity using Variance Inflation Factor

kable(vif(obs_mod_add))
x
Gender2 2.117
Age 1.975
Height 2.115
family_history_with_overweight2 1.356
FAVC2 1.186
FCVC2 3.101
FCVC3 3.250
NCP2 1.649
NCP3 1.839
NCP4 1.330
CAEC2 6.891
CAEC3 5.889
CAEC4 2.226
SMOKE2 1.077
CH2O2 1.274
CH2O3 1.414
SCC2 1.134
FAF1 1.233
FAF2 1.357
FAF3 1.300
TUE1 1.209
TUE2 1.235
CALC2 1.302
CALC3 1.161
CALC4 1.046
MTRANS2 1.070
MTRANS3 1.056
MTRANS4 1.969
MTRANS5 1.366
wt_all = lm(Weight ~ . -ageBins - NObeyesdad, data = obs_trn)
vif(wt_all)[vif(wt_all) > 5]
## CAEC2 CAEC3 
## 6.891 5.889

The above code lists what are deemed to be the variables with a very high collinearity rate (vif > 5). It seems that caec (consumption of food between meals) has a high VIF. Including these variables might pose some issues at explaining the relationship between the response and the predictors.

It is not a surprise to see bmi as among the variables that pose a collinearity issue. This is because it is highly correlated with height and weight (it is actually calculated from them).

We will retain the caec and calc predictors as their interaction with other predictors help generate a suitable model to estimate the weight of a person given the other predictors.

Further, we check if Transformations is required on the response variable. We will use Box-Cox Transformations that considers a family of transformations on positive response variable.

boxcox(obs_mod_add, plotit = TRUE,lambda = seq(-1, 1, by = 0.1))

obs_mod_add_trns=lm((((Weight^0.2)-1)/0.2)~Gender+Age+Height+family_history_with_overweight+FAVC+FCVC+CAEC+SMOKE+FAF+CALC+MTRANS,data=obs_trn)
summary(obs_mod_add_trns)$r.squared
## [1] 0.6506
summary(obs_mod_add_trns)$adj.r.squared
## [1] 0.6459

We observe slight performance improvement. We will be trying this approach on further models

Model2 with two-way intreractions between variables

At this stage we perform the following tasks

  • Build model using two-way Interactions
  • Variable selection using AIC in both direction
  • Variable selection using BIC in both direction
  • Present scores in tabular format
  • LINE assumptions test: Linearity, Independence, Normality and Equal variance
obs_int_mod2=lm(formula = Weight ~ (. - NObeyesdad -ageBins)^2, data = obs_trn)
obs_int_mod2_aic=step(obs_int_mod2,direction="both",trace=0)
obs_int_mod2_bic=step(obs_int_mod2,direction="both",trace=0,k=log(nrow(obs_trn)))
p1=mod_performance(obs_int_mod2_aic)
p2=mod_performance(obs_int_mod2_bic)
p1=append(p1,unlist(list(num_predictors=length(coef(obs_int_mod2_aic)))))
p2=append(p2,unlist(list(num_predictors=length(coef(obs_int_mod2_bic)))))
mod2_stats=data.frame(obs_mod_add_aic=p1,obs_mod_add_bic=p2)

Model3- two-way intreractions and step wise selection

obs_int_mod3=lm(formula = Weight ~ (. - NObeyesdad -ageBins)^3, data = obs_trn)
p1=mod_performance(obs_int_mod3)
p2=mod_performance(obs_int_mod3_bic)
p1=append(p1,unlist(list(num_predictors=length(coef(obs_int_mod3_aic)))))
p2=append(p2,unlist(list(num_predictors=length(coef(obs_int_mod3_bic)))))
kable(data.frame(obs_mod_add_aic=p1,obs_mod_add_bic=p2))

Model4- Selected Intreractions model with Interaction Terms and Outlier diagnostics

obs_int_mod4=lm(Weight~Gender + Age + Height + family_history_with_overweight + 
    FAVC + FCVC + NCP + CAEC + SMOKE + CH2O + SCC + FAF + TUE + 
    CALC+ Gender + MTRANS+FCVC:MTRANS:family_history_with_overweight:Gender:CALC:FAVC:NCP:SCC:Height + Gender:Height + Gender:family_history_with_overweight + 
    Gender:FCVC + Gender:NCP + Age:FCVC + Age:NCP + 
     + Age:CH2O + Age:FAF + Age:TUE + Age:CALC + Height:family_history_with_overweight + 
    family_history_with_overweight:CALC ,data=obs_trn)

mod_performance(obs_int_mod4)
p4=append(p4,unlist(list(num_predictors=length(coef(obs_int_mod4)))))
mod4_stats=data.frame(obs_int_mod4=p4)

r2, adjr2, Train_rmse, Test_rmse, bptest.BP, sptest 8.409e-01 8.153e-01 1.046e+01 7.726e+01 3.339e-07 8.806e-14

Note: we omitted model 3 and model 4 in our final decision as they took far too long to execute and did not yield worthy results. However their performance results are included here for reference.

2.5.N. K-Fold model training

k_fold_modn <- train(
  Weight ~ . - NObeyesdad - ageBins, obs_trn,
  method = "lm",
  trControl = trainControl(
    method = "cv", number = 5,
    verboseIter = FALSE
  )
)
mod_performance(k_fold_modn)

This model performance is not good as compared to the interaction model so we continue next steps with the previously preferred model.

2.6 Model optimization

So far we discovered Model2 obs_mod_add_aic has the best performance, at this stage we will try to apply transformation and remove outliers to see if there are any improvements

  • Transformations
boxcox(obs_mod_add_aic, plotit = TRUE,lambda = seq(-1, 1, by = 0.1))

obs_mod_int_trans=lm((((Weight^0.2)-1)/0.2)~(.- NObeyesdad - ageBins)^2,data=obs_trn)
boxcox(obs_mod_int_trans, plotit = TRUE)

This shows the transformation was not helpful on response variable. Therefore we only apply polynomial transformations to the numerical variables. However our results for polynomial transformations were not better than

obs_int_mod2_log_poly_2=lm(formula = Weight ~ (. - NObeyesdad + I(Height^2) + I(Age^2))^2, data = obs_trn)
obs_int_mod2_log_poly2_aic=step(obs_int_mod2_log_poly_2,direction="both",trace=0)
obs_int_mod2_log_poly2_bic=step(obs_int_mod2_log_poly_2,direction="both",trace=0,k=log(nrow(obs_trn)))
p1=mod_performance(obs_int_mod2_log_poly2_aic)
p2=mod_performance(obs_int_mod2_log_poly2_bic)
p1=append(p1,unlist(list(num_predictors=length(coef(obs_int_mod2_log_poly2_aic)))))
p2=append(p2,unlist(list(num_predictors=length(coef(obs_int_mod2_log_poly2_bic)))))
mod2_stats=data.frame(obs_int_mod2_log_poly2_aic=p1,obs_int_mod2_log_poly2_bic=p2)
kable(mod2_stats)

Note: we omitted these models with transformations in our final decision as added significant time to execute and did not yield worthy results. Their results showed very minor differences in performance metrics and added complexity which was not desirable in our final model.

Leverage

#We cannot remove the records based on leverage as it's considering critical records as unusual. 
htval=hatvalues(obs_mod_int_trans)
obs_leverages=obs_trn[(htval > 2 * mean(htval)),]

Here we found that the records containing high leverage have factor levels that are only in the training set. Had we removed these records, that would conflict with the test set as the test set would contain higher factor levels for some of the categorical variables. Hence we did not remove records with high leverage.

Outliers

outliers=cooks.distance(obs_int_mod2_aic)>4/length(resid(obs_int_mod2_aic))
obs_outliers=obs_trn[outliers,]

3. Results

In the previous section we developed multiple models and use various strategies to improve the overall model performance. Next, we demonstrate our model diagnostics and performance analysis.

3.1 Anova test

Total number of comparisons Model 1 & 2 comparison

  • Model 1: null model obs_mod_add

Number of parameters: 30

  • Model 2: full model obs_int_mod2_aic

Number of parameters: 298

anova(obs_mod_add,obs_int_mod2_aic)[2,"Pr(>F)"]<0.01
## [1] TRUE

We reject null hypothesis \(H_0\) at significance \(\alpha =0.01\). We prefer the interaction model obs_int_mod2_aic over the additive model.

3.2 Model Diagnostics

Model1- Additive model with all the parameters

## 
## Call:
## lm(formula = Weight ~ . - NObeyesdad - ageBins, data = obs_trn)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -57.57 -10.42   1.55   9.91  50.82 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -147.4342    11.0232  -13.37  < 2e-16 ***
## Gender2                           -1.9025     1.1456   -1.66   0.0970 .  
## Age                                0.7665     0.0860    8.92  < 2e-16 ***
## Height                           116.5762     6.1451   18.97  < 2e-16 ***
## family_history_with_overweight2  -17.1378     1.1821  -14.50  < 2e-16 ***
## FAVC2                             -5.8039     1.3608   -4.27  2.1e-05 ***
## FCVC2                              2.0970     1.4078    1.49   0.1365    
## FCVC3                             14.0891     1.5270    9.23  < 2e-16 ***
## NCP2                               2.4972     1.4729    1.70   0.0902 .  
## NCP3                               2.4010     1.1191    2.15   0.0321 *  
## NCP4                               0.0234     2.5966    0.01   0.9928    
## CAEC2                              3.6112     2.8116    1.28   0.1992    
## CAEC3                            -15.6130     3.0394   -5.14  3.1e-07 ***
## CAEC4                             -6.5168     3.7434   -1.74   0.0819 .  
## SMOKE2                            -1.7255     2.5729   -0.67   0.5025    
## CH2O2                              2.0069     0.8944    2.24   0.0250 *  
## CH2O3                              0.9379     1.7623    0.53   0.5946    
## SCC2                               5.6855     2.0973    2.71   0.0068 ** 
## FAF1                               2.3175     0.9211    2.52   0.0120 *  
## FAF2                             -12.2188     1.3441   -9.09  < 2e-16 ***
## FAF3                              -4.6587     2.3703   -1.97   0.0495 *  
## TUE1                              -6.0260     0.9731   -6.19  7.6e-10 ***
## TUE2                              -6.0588     1.9519   -3.10   0.0019 ** 
## CALC2                              4.8716     0.9567    5.09  4.0e-07 ***
## CALC3                              3.5995     2.4021    1.50   0.1342    
## CALC4                             10.6735    16.0250    0.67   0.5055    
## MTRANS2                           10.9665     5.4175    2.02   0.0431 *  
## MTRANS3                           -1.7875     9.3033   -0.19   0.8477    
## MTRANS4                           10.5986     1.2712    8.34  < 2e-16 ***
## MTRANS5                            0.6782     2.7988    0.24   0.8086    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.7 on 1553 degrees of freedom
## Multiple R-squared:  0.65,   Adjusted R-squared:  0.643 
## F-statistic: 99.3 on 29 and 1553 DF,  p-value: <2e-16

Test statistics PValue shows all the variables are important for the model.

Note: we have removed derived variables agebins, NObeyesdad

##         r2      adjr2 Train_rmse  Test_rmse  bptest.BP     sptest 
##  6.497e-01  6.431e-01  1.552e+01  1.669e+01  1.511e-28  2.229e-08
kable(mod1_stats)
obs_mod_add_aic obs_mod_add_bic
r2 0.6483 0.6451
adjr2 0.6427 0.6403
Train_rmse 15.5452 15.6172
Test_rmse 16.7503 16.7521
bptest.BP 0.0000 0.0000
sptest 0.0000 0.0000
num_predictors 26.0000 22.0000
obs_mod_add obs_int_mod2_aic
r2 0.6497 0.8750
adjr2 0.6431 0.8512
Train_rmse 15.5158 9.2679
Test_rmse 16.6875 16.2737
bptest.BP 0.0000 0.0000
sptest 0.0000 0.0000
num_predictors 30.0000 298.0000

We check if Transformations is required on the response variable. We use Box-Cox Transformations that considers a family of transformations on positive response variable.

Model2 with two-way intreractions between variables

obs_mod_add_aic obs_mod_add_bic
r2 0.8750 0.8179
adjr2 0.8512 0.8066
Train_rmse 9.2679 11.1848
Test_rmse 16.2737 13.6781
bptest.BP 0.0000 0.0000
sptest 0.0000 0.0000
num_predictors 298.0000 109.0000

Observations are:

  • 2nd model has higher \(R2\) and lower RMSE on train and test datasets and that we found using two-way interaction model and AIC step wise search
  • Both models fail to rejects Normality test using BPTest.
  • Both models fail to rejects Equal variance test using Shaphirp test.
  • The winner model has 109 predictors

On the Residuals vs Fitted plot, the residuals spread is not bounce constantly both side of the horizontal 0 line, we suspect Linearity assumption. Also, we would suspect this model violates the constant variance assumption as residuals spreads changes with increase in fitted value.

QQ plot shows points spread close the line and we don’t suspect the normality assumption is violated.

Residuals don’t follow normal distribution.

  • Residuals vs Fitted plot, Linearity is suspected
  • Residuals vs Fitted plot, Equal variance is suspected
  • QQPlot, normality is suspected

3.3 The big picture from all models

obs_mod_add_aic obs_mod_add_bic obs_mod_add_aic obs_mod_add_bic
r2 0.8750 0.8179 0.6483 0.6451
adjr2 0.8512 0.8066 0.6427 0.6403
Train_rmse 9.2679 11.1848 15.5452 15.6172
Test_rmse 16.2737 13.6781 16.7503 16.7521
bptest.BP 0.0000 0.0000 0.0000 0.0000
sptest 0.0000 0.0000 0.0000 0.0000
num_predictors 298.0000 109.0000 26.0000 22.0000

This is really interesting to see how complex model overfit the data. We see the Model1 has 189 variables and best \(R^2\) and RMSE scores however it perform poorly on test dataset ie. Test_rmse score and is the most complex out of all 4 models. Thus our best selection is Model2.

3.4 Summary from diagnostics

We used HatValues to identify leverages and Cook’s distance for outliers detection in Methods section. There are 196 outliers and 255 leverages

4. Discussion

A model predicting an individual’s weight based on the dietary, social and physical attributes was trained using the UCI obesity dataset. In preparing the data for model training, we fixed 101 observations whose obesity levels were misclassifed. We also identified predictors that posed some multicollinearity issues (caec, calc, bmi and wstatus). We decided to drop the bmi and wstatus and retained caec and calc as the latters’ interaction with the rest of the predictors enhanced the results of the model that were generated with their presence. We decided on keeping the existing levels of the mtrans categorical variable instead of reducing it after we determined that it did not provide any added benefit.

Here is a summary of the steps we followed, results of those steps, and our anaylsis:

kable(mod2_stats)
obs_mod_add_aic obs_mod_add_bic
r2 0.8750 0.8179
adjr2 0.8512 0.8066
Train_rmse 9.2679 11.1848
Test_rmse 16.2737 13.6781
bptest.BP 0.0000 0.0000
sptest 0.0000 0.0000
num_predictors 298.0000 109.0000

5. Appendix

Below are models we considered in our process but did not deeply consider as their results were not as good, complexity was too high, and runtime was far too significant to allow further testing in a timely manner.

Model3- two-way intreractions and step wise selection

obs_int_mod3=lm(formula = Weight ~ (. - NObeyesdad -ageBins)^3, data = obs_trn)
p1=mod_performance(obs_int_mod3)
p2=mod_performance(obs_int_mod3_bic)
p1=append(p1,unlist(list(num_predictors=length(coef(obs_int_mod3_aic)))))
p2=append(p2,unlist(list(num_predictors=length(coef(obs_int_mod3_bic)))))
kable(data.frame(obs_mod_add_aic=p1,obs_mod_add_bic=p2))

Model4- Selected Intreractions model with Interaction Terms and Outlier diagnostics

obs_int_mod4=lm(Weight~Gender + Age + Height + family_history_with_overweight + 
    FAVC + FCVC + NCP + CAEC + SMOKE + CH2O + SCC + FAF + TUE + 
    CALC+ Gender + MTRANS+FCVC:MTRANS:family_history_with_overweight:Gender:CALC:FAVC:NCP:SCC:Height + Gender:Height + Gender:family_history_with_overweight + 
    Gender:FCVC + Gender:NCP + Age:FCVC + Age:NCP + 
     + Age:CH2O + Age:FAF + Age:TUE + Age:CALC + Height:family_history_with_overweight + 
    family_history_with_overweight:CALC ,data=obs_trn)

mod_performance(obs_int_mod4)
p4=append(p4,unlist(list(num_predictors=length(coef(obs_int_mod4)))))
mod4_stats=data.frame(obs_int_mod4=p4)

r2, adjr2, Train_rmse, Test_rmse, bptest.BP, sptest 8.409e-01 8.153e-01 1.046e+01 7.726e+01 3.339e-07 8.806e-14

Log Response Transformation with All 2-Way Interactions and Stepwise

obs_int_mod2_log=lm(formula = log(Weight) ~ (. - NObeyesdad)^2, data = obs_trn)
obs_int_mod2_log_aic=step(obs_int_mod2_log,direction="both",trace=0)
obs_int_mod2_log_bic=step(obs_int_mod2_log,direction="both",trace=0,k=log(nrow(obs_trn)))
p1=mod_performance(obs_int_mod2_log_aic)
p2=mod_performance(obs_int_mod2_log_bic)
p1=append(p1,unlist(list(num_predictors=length(coef(obs_int_mod2_log_aic)))))
p2=append(p2,unlist(list(num_predictors=length(coef(obs_int_mod2_log_bic)))))
mod2_stats=data.frame(obs_int_mod2_log_aic=p1,obs_int_mod2_log_bic=p2)
kable(mod2_stats)

Log Response, Polynomial Predictors Transformation with All 2-Way Interactions and Stepwise

obs_int_mod2_log_poly_2=lm(formula = log(Weight) ~ (. - NObeyesdad + I(Height^2) + I(Age^2))^2, data = obs_trn)
obs_int_mod2_log_aic=step(obs_int_mod2_log,direction="both",trace=0)
obs_int_mod2_log_bic=step(obs_int_mod2_log,direction="both",trace=0,k=log(nrow(obs_trn)))
p1=mod_performance(obs_int_mod2_log_aic)
p2=mod_performance(obs_int_mod2_log_bic)
p1=append(p1,unlist(list(num_predictors=length(coef(obs_int_mod2_log_aic)))))
p2=append(p2,unlist(list(num_predictors=length(coef(obs_int_mod2_log_bic)))))
mod2_stats=data.frame(obs_int_mod2_log_aic=p1,obs_int_mod2_log_bic=p2)
kable(mod2_stats)

Polynomial Transformations with All 2-Way Interactions and Stepwise

obs_int_mod2_log_poly_2=lm(formula = Weight ~ (. - NObeyesdad + I(Height^2) + I(Age^2))^2, data = obs_trn)
obs_int_mod2_log_poly2_aic=step(obs_int_mod2_log_poly_2,direction="both",trace=0)
obs_int_mod2_log_poly2_bic=step(obs_int_mod2_log_poly_2,direction="both",trace=0,k=log(nrow(obs_trn)))
p1=mod_performance(obs_int_mod2_log_poly2_aic)
p2=mod_performance(obs_int_mod2_log_poly2_bic)
p1=append(p1,unlist(list(num_predictors=length(coef(obs_int_mod2_log_poly2_aic)))))
p2=append(p2,unlist(list(num_predictors=length(coef(obs_int_mod2_log_poly2_bic)))))
mod2_stats=data.frame(obs_int_mod2_log_poly2_aic=p1,obs_int_mod2_log_poly2_bic=p2)
kable(mod2_stats)

Note: we omitted these models with transformations in our final decision as added significant time to execute and did not yield worthy results. Their results showed very minor differences in performance metrics and added complexity which was not desirable in our final model.

References

[1]. Fabio Mendoza Palechor; Alexis de la Hoz Manotas, Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico. Data in Brief, Volume 25, 2019,104344.

[2]. Obesity and overweight

[3]. Estimation of obesity levels based on eating habits and physical condition Data Set, UCI Machine Learning Repository

[4]. Eduardo De-La-Hoz-Correa, Fabio E. Mendoza-Palechor, Alexis De-La-Hoz-Manotas, Roberto C. Morales-Ortega and Sánchez Hernández Beatriz Adriana, Obesity Level Estimation Software based on Decision Trees

[5]. Obesity: a public health problem, MAGAZINE OF SCIENTIFIC AND TECHNOLOGICAL DISCLOSURE OF THE VERACRUZANA UNIVERSITY.

[6]. Body Mass Index calculator